Skip to content

ARROW-15545: [Python][C++] Support casting to extension type#14106

Merged
jorisvandenbossche merged 27 commits into
apache:masterfrom
milesgranger:ARROW-15545_cast-of-extension-types
Oct 11, 2022
Merged

ARROW-15545: [Python][C++] Support casting to extension type#14106
jorisvandenbossche merged 27 commits into
apache:masterfrom
milesgranger:ARROW-15545_cast-of-extension-types

Conversation

@milesgranger

@milesgranger milesgranger commented Sep 13, 2022

Copy link
Copy Markdown
Contributor

Fixes ARROW-15545, fixes ARROW-14500

@github-actions

Copy link
Copy Markdown

@jorisvandenbossche

Copy link
Copy Markdown
Member

I think ideally we handle this on the C++ side, so that it for example also works within a Acero query / dataset scan.

@rok

rok commented Sep 13, 2022

Copy link
Copy Markdown
Member

This would indeed be useful in c++ too. Reading scalar_cast_numeric.cc and scalar_cast_test.cc might be a good way to get familiar with casting there.

@milesgranger

milesgranger commented Sep 13, 2022

Copy link
Copy Markdown
Contributor Author

Thanks @rok, just saw your comment, I'll take a deeper look at that. Now in b9d8cb6 I loop over id or storage id. This creates the same result as the Python casting in e4db00b which fixes the reported issue (dictionary casting ends up as the ExtensionArray UuidType). However, when casting the test's IntegerType it comes back as Int64Array instead of extension array.. (maybe that's okay?)

@milesgranger milesgranger changed the title ARROW-15545: [Python] Support casting to extension type ARROW-15545: [Python][C++] Support casting to extension type Sep 13, 2022
@rok

rok commented Sep 13, 2022

Copy link
Copy Markdown
Member

@milesgranger My guess is you will want to instantiate an ExtensionArray from a 'regular' Array by passing the underlaying buffers, kind of like what CastFromExtension does and then add c++ test like this one.

Comment thread cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_numeric.cc Outdated
@milesgranger

Copy link
Copy Markdown
Contributor Author

@pitrou (and others already subscribed), this is ready for a look now. Couple things I'm not sure about is if I'm adding the supported types in GetCastToExtension correctly and if I should better the C++ tests to cover more type varieties like the Python tests do.

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @milesgranger . Some comments below.

Comment thread cpp/src/arrow/compute/kernels/scalar_cast_extension.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_extension.cc
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_internal.h Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_test.cc
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_test.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_test.cc
@pitrou

pitrou commented Sep 19, 2022

Copy link
Copy Markdown
Member

Also cc @lidavidm for potential opinions.

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think I have much to add to Antoine's comments here.

It may be good to add an entry to "Generic conversions" in the docs: https://github.com/apache/arrow/blob/master/docs/source/cpp/compute.rst#conversions

Comment thread cpp/src/arrow/compute/kernels/scalar_cast_extension.cc Outdated
Comment thread docs/source/cpp/compute.rst Outdated
@lidavidm

Copy link
Copy Markdown
Member

In general that sounds interesting. Especially if that would also allow to change the behaviour of a specific cast (and not just allowing a pre-defined cast to/from the storage type). For example casting to string could give some actual string representation instead of just the string representation of the storage.
But at the same time this also sounds quite complicated for a maybe limited use case?

Cast-to-string is how things like the CSV writer work, right? So maybe the use case isn't all that limited?

@mavam

mavam commented Sep 29, 2022

Copy link
Copy Markdown
Contributor

Glad to see this being worked on!

I'm currently running in the not-yet-implemented error, per https://issues.apache.org/jira/browse/ARROW-17839, where I have a nested struct with an extension type.

Comment on lines 37 to 39

@jorisvandenbossche jorisvandenbossche Sep 29, 2022

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My interpretation was to (for now) disallow any cast unless the input array has exactly a type equal to the target storage type (in which case basically no actual cast is needed, only the creation of the extension array ..)

So something like if (!array->type()->Equals(out_ty)) { ...

(the error message you now wrote would cover that as well)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay.. then it's updated, and also does not cast unless it's needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For future reference, Miles and I chatted about this and were thinking to keep the above, and thus only disallow Extension->Extension casts for now in this PR, while still allow Any->Extension casts based on the storage type casting rules (so if Any can be cast to storage type, then it can also be cast to the Extension type).

This would make it consistent with the existing Extension->Any cast, which also just defers to the casting rules of the storage type at the moment (and does not limit to strictly casting Extension only to the exact storage type). And I think this is also what others were OK with (based on the inline discussion in #14106 (comment))

Personally I still think it's a bit inconsistent to allow casting any unrelated type to ExtensionType, but not allow an unrelated extension array to ExtensionType. But we can see to allow registering such casts in the follow-up https://issues.apache.org/jira/browse/ARROW-17890

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implicit conversions always have to end up inconsistent in practice, so this is not a problem.

For example, in Python you can convert a dict to bool, a bool to float, but not a dict to float.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can use that analogy for the "any -> extension" casts as well (for the case where "any" is not equal to "storage"), that the user needs to do "any -> storage -> extension" explicitly instead .

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but you can use that analogy for the "any -> extension" casts as well (for the case where "any" is not equal to "storage"), that the user needs to do "any -> storage -> extension" explicitly instead .

Well, I wouldn't mind this :-)

Comment thread cpp/src/arrow/compute/kernels/scalar_cast_extension.cc
@jorisvandenbossche

Copy link
Copy Markdown
Member

I opened a follow-up JIRA about allowing more casts / ability to register casts -> https://issues.apache.org/jira/browse/ARROW-17890

Comment thread cpp/src/arrow/compute/kernels/scalar_cast_extension.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_extension.cc Outdated
Comment thread docs/source/cpp/compute.rst Outdated
Comment thread python/pyarrow/tests/test_extension_type.py Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_cast_extension.cc
@milesgranger

Copy link
Copy Markdown
Contributor Author

Is there anything else we ought to address?
I don't think the two failing jobs are related.

@jorisvandenbossche jorisvandenbossche left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the last updates @milesgranger !

Planning to merge this later today (the failures seem unrelated)

@jorisvandenbossche jorisvandenbossche merged commit 06c99f7 into apache:master Oct 11, 2022
@jorisvandenbossche

Copy link
Copy Markdown
Member

Thanks @milesgranger !

@milesgranger milesgranger deleted the ARROW-15545_cast-of-extension-types branch October 11, 2022 08:49
@ursabot

ursabot commented Oct 11, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = 5fb02bd and contender = 06c99f7. 06c99f7 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️1.11% ⬆️0.0%] test-mac-arm
[Failed ⬇️0.82% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.32% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] 06c99f7a ec2-t3-xlarge-us-east-2
[Failed] 06c99f7a test-mac-arm
[Failed] 06c99f7a ursa-i9-9960x
[Finished] 06c99f7a ursa-thinkcentre-m75q
[Finished] 5fb02bdb ec2-t3-xlarge-us-east-2
[Failed] 5fb02bdb test-mac-arm
[Failed] 5fb02bdb ursa-i9-9960x
[Finished] 5fb02bdb ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@ursabot

ursabot commented Oct 11, 2022

Copy link
Copy Markdown

['Python', 'R'] benchmarks have high level of regressions.
ursa-i9-9960x

@jorisvandenbossche

Copy link
Copy Markdown
Member

FWIW the regressions all seems to be flaky runs (they didn't continue to be slower on the subsequent merged commits)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants